{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Introduction to Machine Learning\n", "# Nearest Neighbor Variants\n", "\n", "In this notebook we present a few improved nearest neighbor algorithms.\n", "\n", "**Note:** This notebook is described in the slides, `5.0 Nearest Neighbor Variants.pdf`. All of the important content within this notebook is in those slides, so you are not responsible for this notebook. However, you may reference this notebook to run the examples from the slides.\n", "\n", "Recall from last time:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.neighbors import KDTree\n", "from sklearn.base import BaseEstimator\n", "import numpy as np\n", "\n", "class NearestNeighbor(BaseEstimator):\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames. This makes fit compatible with numpy arrays or DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels.\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # We will iteratively load predictions, so it starts empty\n", " predictions = []\n", " \n", " # Loop over rows in the query\n", " for x in X:\n", " # Query the tree for the nearest neighbor\n", " dist, ind = self.tree.query([x], k=1)\n", " nearest_label = self.y_data[ind[0][0]]\n", " predictions.append(nearest_label)\n", "\n", " # Return the array of predictions we have created\n", " return np.array(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nearest Neighbors Properties\n", "\n", "## Pros\n", "- Very simple\n", "- Sometimes all you need!\n", "- Efficient. $O(\\log(n))$ average case, $O(n)$ worst case.\n", " - Note: If you aren't familiar with this big-O notation, which is covered in COMPSCI 311, don't worry. You will not be tested on it.\n", "\n", "## Cons\n", "- Not always accurate, even with large amounts of data.\n", "\n", "Consider the following case:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This code creates an image of interest. The code is not worth studying, just running.\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "np.random.seed(1)\n", "\n", "# Generate sample data\n", "x = np.ones(10) # All x-values are 1\n", "y = np.random.normal(5, 1, len(x)) # y-values are normally distributed around 5 with some noise\n", "\n", "# Create the plot\n", "plt.figure(figsize=(8, 6))\n", "plt.scatter(x, y, color='blue', label='Data Points')\n", "plt.axvline(x=1, color='red', linestyle='--', label='x = 1')\n", "\n", "# Adding annotations and labels for clarity\n", "plt.title(\"Data Points with Identical x-value and Noisy Labels\")\n", "plt.xlabel(\"x-axis\")\n", "plt.ylabel(\"y-axis\")\n", "plt.legend()\n", "plt.grid(True)\n", "\n", "# Show the plot\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider these 10 ten data points when the query is $x=1$. All of these points are zero distance from the query. Which one should be returned by nearest neighbor? Even as we get more and more data, our predictions given $x=1$ (in this case) will have high variance depending on which point we select.\n", "\n", "**Question**: What should we do in this case?\n", "\n", "**Answer**: Take the average of the labels (or perhaps the median).\n", "\n", "So, we can update our algorithm to break ties by averaging the labels (in the regression setting). However, consider this case:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This code creates an image of interest. The code is not worth studying, just running.\n", "x = np.linspace(0.9, 1.1, 10)\n", "\n", "# Create the plot\n", "plt.figure(figsize=(8, 6))\n", "plt.scatter(x, y, color='blue', label='Data Points')\n", "plt.axvline(x=1, color='red', linestyle='--', label='x = 1')\n", "plt.xlim(0, 2)\n", "\n", "# Adding annotations and labels for clarity\n", "plt.title(\"Data Points Near x = 1 with Noisy Labels\")\n", "plt.xlabel(\"x-axis\")\n", "plt.ylabel(\"y-axis\")\n", "plt.legend()\n", "plt.grid(True)\n", "\n", "# Show the plot\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine that the blue points are just some of the nearby points, and that other points span the x-axis from -100 to 100. What should we do in this case? There is no tie, but there are many points with x-values close to $1.0$ (the query). \n", "\n", "## k-Nearest Neighbors\n", "\n", "**Idea**: Average the labels of the $k$ nearest points, where $k$ is an integer hyperparameter.\n", "\n", "Note: A **hyperparameter** of an ML algorithm is a variable, like $k$, that changes the behavior of the algorithm, and which is often set by the data scientist applying the algorithm.\n", "\n", "This updated algorithm is called **k-Nearest Neighbor** and has the following pseudocode:\n", "\n", "1. Find the $k$ inputs closest to the query.\n", "2. Return the average of the labels for these $k$ closest inputs." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class KNearestNeighbors(BaseEstimator):\n", " # Add a constructor that stores the value of k (a hyperparameter)\n", " def __init__(self, k=3):\n", " self.k = k\n", "\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # Query the tree for the k nearest neighbors for all points in X\n", " dist, ind = self.tree.query(X, k=self.k)\n", "\n", " # Return the average label for the nearest neighbors of each query\n", " return np.mean(self.y_data[ind], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare the performance of k-NN to NN using the matrics that we discussed earlier:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def mean_squared_error(predictions, labels):\n", " return np.mean((predictions - labels) ** 2)\n", "\n", "def root_mean_squared_error(predictions, labels):\n", " return np.sqrt(mean_squared_error(predictions, labels))\n", "\n", "def mean_absolute_error(predictions, labels):\n", " return np.mean(np.abs(predictions - labels))\n", "\n", "def r_squared(predictions, labels):\n", " ss_res = np.sum((labels - predictions) ** 2) # ss_res is the \"Sum of Squares of Residuals\"\n", " ss_tot = np.sum((labels - np.mean(labels)) ** 2) # ss_tot is the \"Total Sum of Squares\"\n", " return 1 - (ss_res / ss_tot)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Load the data set\n", "df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n", "\n", "# We already loaded X and y, but do it again as a reminder\n", "X = df.iloc[:, :-1]\n", "y = df.iloc[:, -1]\n", "\n", "# Split the data into training and testing sets (60% train, 40% test)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=True)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
kMSERMSEMAER^2
011.1520841.0733520.823743-0.687769
120.8534300.9238130.713553-0.250249
230.7644680.8743390.678162-0.119923
350.6883300.8296570.644951-0.008384
4100.6310010.7943560.6202370.075602
51000.5794040.7611860.5969190.151190
610000.5816760.7626770.6002270.147861
750000.6005440.7749470.6166700.120221
\n", "
" ], "text/plain": [ " k MSE RMSE MAE R^2\n", "0 1 1.152084 1.073352 0.823743 -0.687769\n", "1 2 0.853430 0.923813 0.713553 -0.250249\n", "2 3 0.764468 0.874339 0.678162 -0.119923\n", "3 5 0.688330 0.829657 0.644951 -0.008384\n", "4 10 0.631001 0.794356 0.620237 0.075602\n", "5 100 0.579404 0.761186 0.596919 0.151190\n", "6 1000 0.581676 0.762677 0.600227 0.147861\n", "7 5000 0.600544 0.774947 0.616670 0.120221" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# List of values of k to test\n", "k_values = [1, 2, 3, 5, 10, 100, 1000, 5000]\n", "\n", "# List to store the results. This will be a list of dictionaries\n", "results_list = []\n", "\n", "# Evaluate NN and k-NN models\n", "for k in k_values:\n", " model = KNearestNeighbors(k=k)\n", " model.fit(X_train, y_train)\n", " predictions = model.predict(X_test)\n", "\n", " mse = mean_squared_error(predictions, y_test)\n", " rmse = root_mean_squared_error(predictions, y_test)\n", " mae = mean_absolute_error(predictions, y_test)\n", " r2 = r_squared(predictions, y_test)\n", "\n", " # Create a dictionary with the relevant variables from this value of k, and add it to results_list.\n", " results_list.append({'k': k, 'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R^2': r2})\n", "\n", "# Create DataFrame from the list of results. Each dictionary in the list becomes a row in the DataFrame and the keys of the dictionaries become the column headers.\n", "results = pd.DataFrame(results_list)\n", "\n", "# Print the results\n", "display(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make this easier on the eyes, let's highlight the minimum values in bold. The below requires the `jinja2` library:\n", "> pip install jinja2" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 kMSERMSEMAER^2
011.1520841.0733520.823743-0.687769
120.8534300.9238130.713553-0.250249
230.7644680.8743390.678162-0.119923
350.6883300.8296570.644951-0.008384
4100.6310010.7943560.6202370.075602
51000.5794040.7611860.5969190.151190
610000.5816760.7626770.6002270.147861
750000.6005440.7749470.6166700.120221
\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Function to apply bold style to the minimum value in each column, except for R^2 where we highlight the maximum\n", "def highlight_extreme(s):\n", " if s.name == 'R^2':\n", " is_extreme = s == s.max()\n", " else:\n", " is_extreme = s == s.min()\n", " return ['font-weight: bold' if v else '' for v in is_extreme]\n", "\n", "# Apply the styling\n", "styled_results = results.style.apply(highlight_extreme, subset=['MSE', 'RMSE', 'MAE', 'R^2'])\n", "\n", "# Display the styled DataFrame\n", "styled_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Weighted k-Nearest Neighbors\n", "\n", "What if, when running k-NN some of the k-nearest points are very close to the query point and others are actually quite far away? Our method weights these points equally when making its prediction.\n", "\n", "**Idea**: Assign different weights to each of the k neighbors based on their distance from the query point.\n", "\n", "This ensures that closer neighbors have a bigger influence on the prediction than neighbors that are farther away.\n", "\n", "Let $(x^\\text{NN}_i, y^\\text{NN}_i)$ be the $i^\\text{th}$ nearest neighbor.\n", "\n", "Let $w_i$ be the weight associated with the point $(x^\\text{NN}_i, y^\\text{NN}_i)$. We will consider only non-negative weights, i.e., $w_i \\geq 0$. Soon we will describe how to compute $w_i$.\n", "\n", "The weighted k-NN prediction is then:\n", "$$\n", "\\hat y = \\frac{\\sum_{i=1}^k w_i \\, y^\\text{NN}_i}{\\sum_{j=1}^k w_j}\n", "$$\n", "which is equivalent to\n", "$$\n", "\\hat y = \\sum_{i=1}^k \\frac{w_i}{\\sum_{j=1}^k w_j} \\, y^\\text{NN}_i.\n", "$$\n", "\n", "To see why we divide by the sum of the weights, consider the case where $k=2$ and $w_1=w_2=1$. In this case, if we didn't divide by the sum of the weights the prediction would be $y_1^\\text{NN} + y_2^\\text{NN}$, which will be roughly two times too big. Dividing by the sum of the weights makes the weights sum to one, and results in a **weighted average** of the labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Weighting Options\n", "\n", "There are several methods to assign weights in weighted k-NN. \n", "\n", "**Question**: Would it be reasonable to use $w_i=\\operatorname{dist}(x_i^\\text{NN}, x_\\text{query})$?\n", "\n", "**Answer**: No, this would place larger weights on points farther from the query." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One option would be to use the inverse of the distance as the weight:\n", "$$\n", "w_i = \\frac{1}{\\operatorname{dist}(x_i^\\text{NN}, x_\\text{query})}.\n", "$$\n", "However, we might want the weight to decrease faster for points that are farther away. Consider the following bell-curve shape mapping distances to weights:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Gaussian Kernel Function\n", "def gaussian_kernel(distance, sigma=1.0):\n", " return np.exp(- (distance ** 2) / (2 * sigma ** 2))\n", "\n", "# Generate distance values from 0 to 3\n", "distances = np.linspace(0, 3, 100)\n", "\n", "# Compute weights using the Gaussian kernel\n", "weights = gaussian_kernel(distances)\n", "\n", "# Plotting\n", "plt.figure(figsize=(8, 4))\n", "plt.plot(distances, weights)\n", "plt.xlabel(r'dist$(x_i^{\\text{NN}}, x_{\\text{query}})$') # Use raw string so we don't have to escape the backslashes in LaTeX\n", "plt.ylabel(r'$w_i$')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function is called the **Gaussian kernel**. It is the (re-scaled) probability density function of a normal distribution:\n", "$$\n", "f(x)=\\frac{1}{\\sigma \\sqrt{2\\pi}}e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}},\n", "$$\n", "with mean $\\mu=0$ and standard deviation $\\sigma=1$.\n", "\n", "Because we will divide all weights by the sum of the weights, scaling each weight by the same constant has no impact on the final weighting. So, we can drop the $\\frac{1}{\\sigma \\sqrt{2\\pi}}$ term and write the Gaussian kernel as:\n", "$$\n", "e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}}.\n", "$$\n", "We will apply this with $\\mu=0$ and with $x$ corresponding to the distance between the query and point, so we can write this as:\n", "$$\n", "e^{-\\frac{\\operatorname{dist}(x_i^\\text{NN}, x_\\text{query})^2}{2\\sigma^2}}.\n", "$$\n", "\n", "Cleaning this up, let:\n", "$$\n", "d_i = \\operatorname{dist}(x_i^\\text{NN}, x_\\text{query}),\n", "$$\n", "and\n", "$$\n", "w_i = e^{-\\frac{d_i^2}{2\\sigma^2}}.\n", "$$\n", "\n", "We can then plug this weight into the weighted k-NN prediction equation:\n", "$$\n", "\\hat y = \\frac{\\sum_{i=1}^k w_i \\, y^\\text{NN}_i}{\\sum_{i=1}^k w_i}.\n", "$$\n", "\n", "Before we experiment with this new NN variant, let's think more about what $\\sigma$ (another hyperparameter) does.\n", "\n", "**Question**: What is the impact of a bigger value for $\\sigma$?\n", "\n", "**Answer**: A larger value of sigma makes the weight curve wider, placing larger weights on points that are farther away. A smaller value of sigma makes the weight curve tighter, placing more emphasis on points that are closer. We can see this in the following plot:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting for different sigma values\n", "plt.figure(figsize=(8, 4))\n", "\n", "for sigma in [0.1, 1, 10]:\n", " weights = gaussian_kernel(distances, sigma)\n", " plt.plot(distances, weights, label=rf'$\\sigma = {sigma}$') # LaTeX formatting for sigma\n", "\n", "\n", "plt.xlabel(r'dist$(x_i^{\\text{NN}}, x_{\\text{query}})$') # Use raw string for LaTeX\n", "plt.ylabel(r'$w_i$')\n", "plt.title('Gaussian Kernel with Different Sigma Values')\n", "plt.legend()\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, let's code up the weighted k-NN algorithm!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.neighbors import KDTree\n", "from sklearn.base import BaseEstimator\n", "import numpy as np\n", "\n", "class WeightedKNearestNeighbors(BaseEstimator):\n", " # Add a constructor that stores the value of k and sigma (hyperparameters)\n", " def __init__(self, k=3, sigma=1.0):\n", " self.k = k\n", " self.sigma = sigma\n", "\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def gaussian_kernel(self, distance):\n", " # Gaussian kernel function\n", " return np.exp(- (distance ** 2) / (2 * self.sigma ** 2))\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # We will iteratively load predictions, so it starts empty\n", " predictions = []\n", " \n", " # Loop over rows in the query\n", " for x in X:\n", " # Query the tree for the k nearest neighbors\n", " dist, ind = self.tree.query([x], k=self.k)\n", "\n", " # Calculate weights using the Gaussian kernel\n", " weights = self.gaussian_kernel(dist[0])\n", "\n", " # Check if weights sum to zero. This happens when all points are very far, giving weights that round to zero, causing divison by zero later. In this case, revert to un-weighted (all weights are one).\n", " if np.sum(weights) == 0:\n", " # If weights sum to zero, assign equal weight to all neighbors\n", " weights = np.ones_like(weights)\n", "\n", " # Weighted average of the labels of the k nearest neighbors\n", " weighted_avg_label = np.average(self.y_data[ind[0]], weights=weights)\n", " predictions.append(weighted_avg_label)\n", "\n", " # Return the array of predictions we have created\n", " return np.array(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add this to our comparison of methods:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ModelMSERMSEMAER^2
0k-NN k=1 sigma=None1.1520841.0733520.823743-0.687769
1k-NN k=100 sigma=None0.5794040.7611860.5969190.151190
2k-NN k=100 sigma=1000.5795720.7612970.5969520.150943
3k-NN k=200 sigma=1000.5775540.7599700.5962200.153901
4k-NN k=300 sigma=1000.5774430.7598970.5964080.154062
5k-NN k=400 sigma=1000.5776200.7600130.5966700.153804
6k-NN k=500 sigma=1000.5780770.7603140.5970440.153135
\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Model parameters to test\n", "parameters = [\n", " {\"k\": 1, \"sigma\": None}, # Standard k-NN\n", " {\"k\": 100, \"sigma\": None}, # Standard k-NN\n", " {\"k\": 100, \"sigma\": 100},\n", " {\"k\": 200, \"sigma\": 100},\n", " {\"k\": 300, \"sigma\": 100},\n", " {\"k\": 400, \"sigma\": 100},\n", " {\"k\": 500, \"sigma\": 100}\n", "]\n", "\n", "# Dictionary to store results\n", "results = []\n", "\n", "# Training and evaluating each model\n", "for param in parameters:\n", " # Determine which model to use. Notice that we can use k-NN for NN by setting k=1\n", " if param[\"sigma\"] is None:\n", " model = KNearestNeighbors(k=param[\"k\"])\n", " else:\n", " model = WeightedKNearestNeighbors(k=param[\"k\"], sigma=param[\"sigma\"])\n", "\n", " # Train the model and get the predictions on the test set\n", " model.fit(X_train, y_train)\n", " predictions = model.predict(X_test)\n", "\n", " mse = mean_squared_error(predictions, y_test)\n", " rmse = root_mean_squared_error(predictions, y_test)\n", " mae = mean_absolute_error(predictions, y_test)\n", " r2 = r_squared(predictions, y_test)\n", "\n", " results.append({\"Model\": f\"k-NN k={param['k']} sigma={param['sigma']}\", \n", " \"MSE\": mse, \"RMSE\": rmse, \"MAE\": mae, \"R^2\": r2})\n", "\n", "# Creating DataFrame for results\n", "results_df = pd.DataFrame(results)\n", "\n", "# Finding the best (minimum or maximum) values for each metric\n", "best_metrics = {\n", " \"MSE\": results_df['MSE'].idxmin(),\n", " \"RMSE\": results_df['RMSE'].idxmin(),\n", " \"MAE\": results_df['MAE'].idxmin(),\n", " \"R^2\": results_df['R^2'].idxmax()\n", "}\n", "\n", "# Highlighting the best values in the DataFrame\n", "def highlight_best(row, best_metrics):\n", " return ['font-weight: bold' if (col in best_metrics and row.name == best_metrics[col]) else '' for col in row.index]\n", "\n", "# Apply the highlighting\n", "styled_results = results_df.style.apply(highlight_best, best_metrics=best_metrics, axis=1)\n", "styled_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, it doesn't make a big difference. We will re-visit this later, seeing more significant improvements for other data sets.\n", "\n", "**Question**: How can the nearest neighbor algorithms be extended to the classification setting?\n", "\n", "**Answer (k-NN)**: The most common method is to use a majority vote among the k nearest neighbors. That is, whichever label is most common among the nearest neighbors is selected. \n", "\n", "**Answer (weighted k-NN)**: Each neighbor's vote is weighted in the vote." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tuning Hyperparameters\n", "\n", "Let we will discuss the tuning of hyperparameters more. For now, notice that this is often more of an art than a science. Over time, you become more familiar with how different parameters change the behavior of algorithms and learn to guess changes to hyperparameters that could be more effective.\n", "\n", "In this case, let's run a **grid search** over values of k and $\\sigma$ to see which are most effective.\n", "\n", "A **grid search** for hyperparameters involves training an evaluating a model exhaustively over a specified range of hyperparameter values, with the aim of identifying the combination of parameters that results in the best performance. This process involves creating and testing every possible combination of the provided parameter values." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from mpl_toolkits.mplot3d import Axes3D\n", "import pickle\n", "import os\n", "\n", "# Path to the cache file\n", "cache_file = 'cache/Lecture_5/hyperparamSearch.pkl'\n", "re_run = False # Set to True to force re-running the grid search\n", "\n", "# Check if cache file exists and re_run is False\n", "if not os.path.exists(cache_file) or re_run:\n", " # Define the ranges for k and sigma\n", " k_values = [k for k in range(100, 1100, 100)]\n", " sigma_values = [20, 50, 75, 100, 200, 400, 600]\n", "\n", " # Initialize matrix to store R^2 values\n", " r2_values = np.zeros((len(k_values), len(sigma_values)))\n", "\n", " # Grid search\n", " for i, k in enumerate(k_values):\n", " for j, sigma in enumerate(sigma_values):\n", " model = WeightedKNearestNeighbors(k=k, sigma=sigma)\n", " model.fit(X_train, y_train)\n", " predictions = model.predict(X_test)\n", "\n", " # Compute R^2 value\n", " r2 = r_squared(predictions, y_test)\n", " r2_values[i, j] = r2\n", " \n", " # Save the results to a pickle file\n", " with open(cache_file, 'wb') as f:\n", " pickle.dump({'r2_values': r2_values, 'k_values': k_values, 'sigma_values': sigma_values}, f)\n", "else:\n", " # Load the results from the pickle file\n", " with open(cache_file, 'rb') as f:\n", " data = pickle.load(f)\n", " r2_values = data['r2_values']\n", " k_values = data['k_values']\n", " sigma_values = data['sigma_values']" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "# Assuming r2_values, sigma_values, and k_values are defined\n", "\n", "fig, ax = plt.subplots()\n", "# Create a heatmap with specified value range\n", "sns.heatmap(r2_values, annot=True, fmt=\".3f\", cmap='viridis', \n", " xticklabels=sigma_values, yticklabels=k_values, ax=ax, \n", " vmin=0.15, vmax=0.154)\n", "\n", "# Set axis labels\n", "ax.set_xlabel('Sigma')\n", "ax.set_ylabel('k')\n", "ax.set_title('Heatmap of R^2 for Different k and sigma')\n", "\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**: Why do you think a value of k around 200 is particularly effective for this problem?\n", "\n", "**Question**: Why do you think changing sigma makes little difference when it is large?\n", "\n", "As you work with each ML algorithm, you'll start to get a sense for how to set the different hyperparameters, and how to tune them manually." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }